From the very basics
This is an introduction course and we expect no prior knowledge
Adjust your speed to your own level
Ask questions at any time
Tell us if you are bored or overwhelmed
Introduction to R and RStudio
Data analyses cookbook
| Time | Topic |
|---|---|
| 27th 9.15 - 11.00 | Introduction to R, Rstudio, functions, Rscript |
| 27th 11.00 - 11.15 | Coffee break |
| 27th 11.15 - 12.00 | Create simple plots with base R |
| 27th 12.00 - 13.00 | Lunch break |
| 27th 13.00 - 13.45 | How to import data |
| 27th 13.45 - 14.45 | Data inspection |
| 27th 14.45 - 15.00 | Coffee break |
| 27th 15.00 - 16.55 | Preparing data for analysis |
| 27th 16.55 - 17.00 | Explain graded exercise |
| 28th 9.15 - 12.15 | Repetition of Monday & Work on graded exercise |
There will be a final graded exercise (pass / fail 0.5 ECT).
We will explain the exercise at the end of today and you can work on it tomorrow morning.
The deadline for handing in the exercise is the 10th of February.
Did you all manage to install R and RStudio?
Now, you can download, save and open the “follow_along_script.R”
“#” can be use to add text and comments
Why could this be helpful?
Input(s) are called “arguments”
To find out what arguments the function requires
c combines its arguments
How to access an element in the vector
How to access an element from a matrix
Can you calculate the following for “a_vector”?
Can you figure out what you can do with the following functions?
Numeric and character
Logical
factor
Lists
x y
1 1 a
2 2 b
3 3 c
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
11 11 28
12 12 14
13 12 20
14 12 24
15 12 28
16 13 26
17 13 34
18 13 34
19 13 46
20 14 26
21 14 36
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
29 17 32
30 17 40
31 17 50
32 18 42
33 18 56
34 18 76
35 18 84
36 19 36
37 19 46
38 19 68
39 20 32
40 20 48
41 20 52
42 20 56
43 20 64
44 22 66
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
57 6.3 3.3 4.7 1.6 versicolor
58 4.9 2.4 3.3 1.0 versicolor
59 6.6 2.9 4.6 1.3 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
62 5.9 3.0 4.2 1.5 versicolor
63 6.0 2.2 4.0 1.0 versicolor
64 6.1 2.9 4.7 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
66 6.7 3.1 4.4 1.4 versicolor
67 5.6 3.0 4.5 1.5 versicolor
68 5.8 2.7 4.1 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
70 5.6 2.5 3.9 1.1 versicolor
71 5.9 3.2 4.8 1.8 versicolor
72 6.1 2.8 4.0 1.3 versicolor
73 6.3 2.5 4.9 1.5 versicolor
74 6.1 2.8 4.7 1.2 versicolor
75 6.4 2.9 4.3 1.3 versicolor
76 6.6 3.0 4.4 1.4 versicolor
77 6.8 2.8 4.8 1.4 versicolor
78 6.7 3.0 5.0 1.7 versicolor
79 6.0 2.9 4.5 1.5 versicolor
80 5.7 2.6 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
85 5.4 3.0 4.5 1.5 versicolor
86 6.0 3.4 4.5 1.6 versicolor
87 6.7 3.1 4.7 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
89 5.6 3.0 4.1 1.3 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
92 6.1 3.0 4.6 1.4 versicolor
93 5.8 2.6 4.0 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
95 5.6 2.7 4.2 1.3 versicolor
96 5.7 3.0 4.2 1.2 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
100 5.7 2.8 4.1 1.3 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3.0 5.5 2.1 virginica
114 5.7 2.5 5.0 2.0 virginica
115 5.8 2.8 5.1 2.4 virginica
116 6.4 3.2 5.3 2.3 virginica
117 6.5 3.0 5.5 1.8 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
120 6.0 2.2 5.0 1.5 virginica
121 6.9 3.2 5.7 2.3 virginica
122 5.6 2.8 4.9 2.0 virginica
123 7.7 2.8 6.7 2.0 virginica
124 6.3 2.7 4.9 1.8 virginica
125 6.7 3.3 5.7 2.1 virginica
126 7.2 3.2 6.0 1.8 virginica
127 6.2 2.8 4.8 1.8 virginica
128 6.1 3.0 4.9 1.8 virginica
129 6.4 2.8 5.6 2.1 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 virginica
135 6.1 2.6 5.6 1.4 virginica
136 7.7 3.0 6.1 2.3 virginica
137 6.3 3.4 5.6 2.4 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
Help on topic 'plot' was found in the following packages:
Package Library
graphics /Library/Frameworks/R.framework/Versions/4.2/Resources/library
base /Library/Frameworks/R.framework/Resources/library
Using the first match ...
Get access to specific set of functions
run “library()” every time you want to use any function from this package
Using the “Help” rider in the Output pane and typing the package into the search field, which will provide you with a brief description
Search for package details online in CRAN
Getting the help documentation of the package, which lists all functions and their description
Install the package “pacman”
open the help page for “pacman”
You can use pacman function “p_load” to load multiple packages at the same time.
Import and inspect data in R
Learn the basics of data cleaning and wrangling using tidyverse
Implement basic operations and summaries in R
Gain hands-on experience through exercises
Data on Covid vaccinations from Basel
National survey on health and nutrition in the US (https://wwwn.cdc.gov/nchs/nhanes/)
Before downloading data, we organize our directory
Essential steps to set-up your R-environment include
Set directories
Load required packages
01_oridata: Here you store all original files. DO NOT ALTER THIS DATA!
02_data: Altered data.
03_code: Here you should store all your R-script / markdown / qmd-files etc. In this class, we only work with one script.
04_output: Here you store all your output files (e.g., tables, figures).
Use a folder structure, not your desktop!
Copy the code below into your R script.
Replace “d_proj” with the path to your folder.
Run the code + check if result is correct.
Save the script with a name of your choice in the folder “03_script”.
# Set the project root directory
d_proj <- "/Users/jb22m516/Documents/GitHub/getting_started_with_R/lesson_material/exercises"
# Basic folders
# Directory for original data
d_oridata <- file.path(d_proj, "01_oridata")
# Directory for edited data
d_data <- file.path(d_proj, "02_data")
# Directory for R scripts
d_code <- file.path(d_proj, "03_code")
# Directory for R output
d_output <- file.path(d_proj, "04_output")
## This code below creates the folders in case you haven't set them up yet
# Create a vector with all directories
dirs <- c(d_oridata, d_data, d_code, d_output)
# Loop through the directories and create them if they don't exist
for (dir in dirs) {
if (!dir.exists(dir)) {
dir.create(dir, recursive = TRUE)
}
}download “nhanes_for_R.xlsx” and save it to your folder “01_oridata”
You can specify several options within read_excel()
# sheet: Specify the sheet name or number.
nhanes <- read_excel(file.path(d_data, "nhanes_for_R.xlsx"), sheet = 1)
# range: Import a specific range of cells.
nhanes <- read_excel(file.path(d_data, "nhanes_for_R.xlsx"), range = "A1:D100")
# col_names: Specify if the first row contains column names.
# TRUE (default): First row is used as column names.
# FALSE: R assigns default column names (X1, X2, etc.).
nhanes <- read_excel(file.path(d_data, "nhanes_for_R.xlsx"), col_names = FALSE)
# skip: Skip the first n rows
nhanes <- read_excel(file.path(d_data, "nhanes_for_R.xlsx"), skip = 3)
# na: Define missing values.
nhanes <- read_excel(file.path(d_data, "nhanes_for_R.xlsx"), na = c("NA", "99"))There are other packages and functions for importing other data formats. Most common are:
csv-files: read.csv() (Base R) or read_csv() (readr)
STATA, SAS, SPSS (haven): read_dta (Stata files), read_sav() (SPSS files), read_sas (SAS files)
Download the data “BMX_J.xpt” from GitHub and put it into your “01_oridata” folder. This is the original data file from the NHANES dataset that contains all body measures (e.g., height, weight).
Find out which package you need to import an xpt-file.
Import the file in R and assign it to the object “nhanes_body”.
Download the data “BMX_J.xpt” from GitHub and put it into your “01_oridata” folder. This is the original data file from the NHANES dataset that contains all body measures (e.g., height, weight).
Find out which package you need to import an xpt-file.
Import the file in R and assign it to the object “nhanes_body”.
Dimensions: Check number of rows and columns
Column names: List the names of all variables
Data structure: Provides overview of dataset structure, including variable types and the first few observations
tibble [9,254 × 14] (S3: tbl_df/tbl/data.frame)
$ SEQN : num [1:9254] 93703 93704 93705 93706 93707 ...
$ RIAGENDR: num [1:9254] 2 1 2 1 1 2 2 2 1 1 ...
$ RIDAGEYR: num [1:9254] 2 2 66 18 13 66 75 0 56 18 ...
$ RIDRETH1: num [1:9254] 5 3 4 5 5 5 4 3 5 1 ...
$ DMDEDUC2: num [1:9254] NA NA 2 NA NA 1 4 NA 5 NA ...
$ INDHHIN2: num [1:9254] 15 15 3 NA 10 6 2 15 15 4 ...
$ BMXWT : num [1:9254] 13.7 13.9 79.5 66.3 45.4 53.5 88.8 10.2 62.1 58.9 ...
$ BMXHT : num [1:9254] 88.6 94.2 158.3 175.7 158.4 ...
$ BPXSY1 : num [1:9254] NA NA NA 112 128 NA 120 NA 108 112 ...
$ BPXDI1 : num [1:9254] NA NA NA 74 38 NA 66 NA 68 68 ...
$ BPXPLS : num [1:9254] NA NA 52 82 100 68 74 NA 62 68 ...
$ SMQ020 : num [1:9254] NA NA 1 2 NA 2 1 NA 2 1 ...
$ SMQ040 : num [1:9254] NA NA 3 NA NA NA 1 NA NA 2 ...
$ SMQ900 : num [1:9254] NA NA 2 2 NA 2 1 NA 2 1 ...
Rows: 9,254
Columns: 14
$ SEQN <dbl> 93703, 93704, 93705, 93706, 93707, 93708, 93709, 93710, 93711…
$ RIAGENDR <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1…
$ RIDAGEYR <dbl> 2, 2, 66, 18, 13, 66, 75, 0, 56, 18, 67, 54, 71, 61, 22, 45, …
$ RIDRETH1 <dbl> 5, 3, 4, 5, 5, 5, 4, 3, 5, 1, 3, 4, 5, 5, 3, 4, 3, 4, 1, 3, 3…
$ DMDEDUC2 <dbl> NA, NA, 2, NA, NA, 1, 4, NA, 5, NA, 3, 4, 3, 5, 3, 3, NA, NA,…
$ INDHHIN2 <dbl> 15, 15, 3, NA, 10, 6, 2, 15, 15, 4, 6, 7, 8, 15, NA, 10, 14, …
$ BMXWT <dbl> 13.7, 13.9, 79.5, 66.3, 45.4, 53.5, 88.8, 10.2, 62.1, 58.9, 7…
$ BMXHT <dbl> 88.6, 94.2, 158.3, 175.7, 158.4, 150.2, 151.1, NA, 170.6, 172…
$ BPXSY1 <dbl> NA, NA, NA, 112, 128, NA, 120, NA, 108, 112, 104, NA, 112, 12…
$ BPXDI1 <dbl> NA, NA, NA, 74, 38, NA, 66, NA, 68, 68, 70, NA, 60, 72, 62, 8…
$ BPXPLS <dbl> NA, NA, 52, 82, 100, 68, 74, NA, 62, 68, 90, 90, 66, 58, 60, …
$ SMQ020 <dbl> NA, NA, 1, 2, NA, 2, 1, NA, 2, 1, 1, 1, 1, 1, 1, 2, NA, NA, 2…
$ SMQ040 <dbl> NA, NA, 3, NA, NA, NA, 1, NA, NA, 2, 1, 3, 1, 3, 1, NA, NA, N…
$ SMQ900 <dbl> NA, NA, 2, 2, NA, 2, 1, NA, 2, 1, 2, 2, 1, 2, 1, 2, NA, NA, 2…
Quick summary of each variable: Gives you summary statistics for each column, including missings
SEQN RIAGENDR RIDAGEYR RIDRETH1
Min. : 93703 Min. :1.000 Min. : 0.00 Min. :1.000
1st Qu.: 96016 1st Qu.:1.000 1st Qu.:11.00 1st Qu.:3.000
Median : 98330 Median :2.000 Median :31.00 Median :3.000
Mean : 98330 Mean :1.508 Mean :34.33 Mean :3.234
3rd Qu.:100643 3rd Qu.:2.000 3rd Qu.:58.00 3rd Qu.:4.000
Max. :102956 Max. :2.000 Max. :80.00 Max. :5.000
DMDEDUC2 INDHHIN2 BMXWT BMXHT
Min. :1.000 Min. : 1.0 Min. : 3.20 Min. : 78.3
1st Qu.:3.000 1st Qu.: 6.0 1st Qu.: 43.10 1st Qu.:151.4
Median :4.000 Median : 8.0 Median : 67.75 Median :161.9
Mean :3.526 Mean :12.5 Mean : 65.14 Mean :156.6
3rd Qu.:4.000 3rd Qu.:14.0 3rd Qu.: 85.60 3rd Qu.:171.2
Max. :9.000 Max. :99.0 Max. :242.60 Max. :197.7
NA's :3685 NA's :491 NA's :674 NA's :1238
BPXSY1 BPXDI1 BPXPLS SMQ020
Min. : 72.0 Min. : 0.00 Min. : 34.00 Min. :1.000
1st Qu.:106.0 1st Qu.: 60.00 1st Qu.: 66.00 1st Qu.:1.000
Median :118.0 Median : 70.00 Median : 72.00 Median :2.000
Mean :121.3 Mean : 67.84 Mean : 73.75 Mean :1.597
3rd Qu.:132.0 3rd Qu.: 76.00 3rd Qu.: 82.00 3rd Qu.:2.000
Max. :228.0 Max. :136.00 Max. :136.00 Max. :2.000
NA's :2952 NA's :2952 NA's :2512 NA's :3398
SMQ040 SMQ900
Min. :1.000 Min. :1.000
1st Qu.:1.000 1st Qu.:2.000
Median :3.000 Median :2.000
Mean :2.226 Mean :1.805
3rd Qu.:3.000 3rd Qu.:2.000
Max. :3.000 Max. :9.000
NA's :6895 NA's :3398
Data preview: View the first view or last few rows of your dataset
# A tibble: 6 × 14
SEQN RIAGENDR RIDAGEYR RIDRETH1 DMDEDUC2 INDHHIN2 BMXWT BMXHT BPXSY1 BPXDI1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 93703 2 2 5 NA 15 13.7 88.6 NA NA
2 93704 1 2 3 NA 15 13.9 94.2 NA NA
3 93705 2 66 4 2 3 79.5 158. NA NA
4 93706 1 18 5 NA NA 66.3 176. 112 74
5 93707 1 13 5 NA 10 45.4 158. 128 38
6 93708 2 66 5 1 6 53.5 150. NA NA
# ℹ 4 more variables: BPXPLS <dbl>, SMQ020 <dbl>, SMQ040 <dbl>, SMQ900 <dbl>
# A tibble: 6 × 14
SEQN RIAGENDR RIDAGEYR RIDRETH1 DMDEDUC2 INDHHIN2 BMXWT BMXHT BPXSY1 BPXDI1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 102951 1 4 3 NA 10 23.8 109. NA NA
2 102952 2 70 5 3 4 49 156. 136 74
3 102953 1 42 1 3 12 97.4 165. 124 76
4 102954 2 41 4 5 10 69.1 163. 116 66
5 102955 2 14 4 NA 9 112. 157. 114 62
6 102956 1 38 3 4 7 112. 176. 150 98
# ℹ 4 more variables: BPXPLS <dbl>, SMQ020 <dbl>, SMQ040 <dbl>, SMQ900 <dbl>
# You can append this argument to specify the first/last number of rows you want to see:
head(nhanes, n = 10)# A tibble: 10 × 14
SEQN RIAGENDR RIDAGEYR RIDRETH1 DMDEDUC2 INDHHIN2 BMXWT BMXHT BPXSY1 BPXDI1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 93703 2 2 5 NA 15 13.7 88.6 NA NA
2 93704 1 2 3 NA 15 13.9 94.2 NA NA
3 93705 2 66 4 2 3 79.5 158. NA NA
4 93706 1 18 5 NA NA 66.3 176. 112 74
5 93707 1 13 5 NA 10 45.4 158. 128 38
6 93708 2 66 5 1 6 53.5 150. NA NA
7 93709 2 75 4 4 2 88.8 151. 120 66
8 93710 2 0 3 NA 15 10.2 NA NA NA
9 93711 1 56 5 5 15 62.1 171. 108 68
10 93712 1 18 1 NA 4 58.9 173. 112 68
# ℹ 4 more variables: BPXPLS <dbl>, SMQ020 <dbl>, SMQ040 <dbl>, SMQ900 <dbl>
Check class / variable type of your dataset / columns
[1] "tbl_df" "tbl" "data.frame"
[1] "numeric"
SEQN RIAGENDR RIDAGEYR RIDRETH1 DMDEDUC2 INDHHIN2 BMXWT BMXHT
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
BPXSY1 BPXDI1 BPXPLS SMQ020 SMQ040 SMQ900
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Download the data Covid19_vaccines_Basel. This dataset contains information about the number of Covid-19 vaccinations in the Canton Basel-Citiy between 2 January 2021 and July 1 2023.
Import the excel file Covid_19_vaccines_Basel.xlsx into R and assign it to the object “covid”.
How many columns does the dataset have?
How many rows?
Get an overview/summary of your data.
What is the median number of vaccinations per day? (variable Vac_perday)
How many missing values do we have for booster vaccinations? (variable Total_vacbooster)
What is the total number of vaccinations in row 14? (variable Total_vac)
# 1. Import the excel file Covid_19_vaccines_Basel into R
covid <- read_excel(file.path(d_oridata, "Covid19_vaccines_Basel.xlsx"))
# 2. How many columns does the dataset have?
# 6 columns
dim(covid)
ncol(covid)
# 3. How many rows?
# 915 rows
dim(covid)
nrow(covid)
# 4. Use one of the functions to get an overview of your data
str(covid) # option 1
glimpse(covid) # option 2
# 5. What is the median of vaccinations per day? (variable Vac_perday)
# 72 years
summary(covid)
# 6. How many missings do we have for booster vaccinations? (variable Total_vacbooster)
# 58
summary(covid)
# 7. What is the total number of vaccinations in line 14? (variable Total_vac)
# 15,806
head(covid, n = 14)Check for missing values: Summarize missing values in the dataset
Check the number of missing values for all variables in the covid dataset.
Check missings only for total vaccine boosters (variable Total_vacbooster).
Check the number of missing values for all variables in the covid dataset.
Check missings only for total vaccine boosters (variable Total_vacbooster).
The tidyverse is a collection of R packages designed for data science.
Key Features: - Focus on tidy data principles.
Easy-to-use, consistent syntax.
Handles data manipulation, visualization, and more.
Core Packages:
dplyr (data manipulation)
ggplot2 (data visualization)
tidyr (data tidying)
readr (data import)
tibble (modern data frames)
Simplifies Workflow: Combines common tasks (e.g., cleaning, analyzing, and visualizing data).
Consistent Grammar: Shared principles across packages (e.g., “verbs” like filter, select, mutate in dplyr).
Readable Code: Code becomes intuitive and easier to share or collaborate on.
Built-in Visualization: ggplot2 helps create high-quality, customizable plots.
We want to:
Filter the rows for all who are males (RIAGENDR == 1)
Calculate the average weight.
Base R
Tidyverse
nhanes %>%:Start with the nhanes dataset.
filter(RIAGENDR == 1) %>%: Filter for “RIAGENDR == 1” (male), pass filtered dataset to next function.
summarize(mean_weight = mean(BMXWT, na.rm = TRUE)): Calculate mean of BMXWT column and store it as mean_weight. na.rm = TRUE tells R to ignore/remove missing values for this operation.
%>%:
Takes the output from the left: The value or object on the left side of the pipe is passed as the first argument of the function on the right side.
Sends it to the next step: After the function on the right finishes its work, its result is sent as input to the next function in the chain.
Repeat until done: This process continues for as many steps as you chain together.
The most commonly used functions in the tidyverse (mostly from “dplyr”):
select: For selecting columns
filter: For filtering rows based upon condition(s).
arrange: For sorting data
rename: For renaming variables
mutate: For creating / modifying variables.
group_by and summarize: For aggregating data
class conversions (e.g., as.factor): For converting a variable from one class into another one.
if_else and case_if: For categorizing data
relocate: For re-ordering variables in a dataframe
relevel: For setting a reference category.
Selects specific columns from a dataset, to:
reduce the number of columns in a dataset, making it easier to work with
reorganize the order of columns
exclude specific columns.
From now on, we will only work with a subset of variables from NHANES:
SEQN: Respondent sequence number.
RIAGENDR: Gender.
RIDAGEYR: Age.
BMXWT: Weight.
BMXHT: Height.
Basic selection: This is the easiest way to choose your variables - selecting your variables by specific names. This is the dataset we will keep for other examples.
Options: The select-function can do a lot more!
Reorder columns: You can reorder columns by specifying the order
Select columns by range: Use column positions to select columns
Exclude columns using the “-” operator
Select columns by pattern: You can select columns by pattern or name
When writing the code for the following exercises, assign them to the object “ds”.
From nhanes, select the columns SEQN, RIAGENDR, and SMQ040.
Reorder the columns to: SMQ040, RIAGENDR, SEQN.
Exclude the BPXSY1 and BPXDI1 columns.
Select columns that start with BPX.
Bring two operations (1. and 2.) together using the pipe:
a) Select only the columns SEQN, RIAGENDR, and SMQ040
b) Reorder the columns to: SMQ040, RIAGENDR, SEQN
# 1. Select only the columns SEQN, RIAGENDR, and SMQ040.
ds <- nhanes %>%
select(SEQN, RIAGENDR, SMQ040)
# 2. Reorder the columns to: SMQ040, RIAGENDR, SEQN.
ds <- nhanes %>%
select(SMQ040, RIAGENDR, SEQN)
# 3. Exclude the BPXSY1 and BPXDI1 columns.
ds <- nhanes %>%
select(-BPXSY1, -BPXDI1)
# 4. Select columns that start with "BPX".
ds <- nhanes %>%
select(starts_with("BPX"))
# 5. Bring two operations together using the pipe
ds <- nhanes %>%
# a) Select only the columns SEQN, RIAGENDR, and SMQ040
select(SEQN, RIAGENDR, SMQ040) %>%
# b) Reorder the columns to: SMQ040, RIAGENDR, SEQN
select(SMQ040, RIAGENDR, SEQN)filter(): select rows in a dataset that meet certain conditions, to:
focus on relevant subsets of data for analysis.
exclude rows that don’t meet certain criteria.
explore and validate data by applying logical conditions.
Basic filtering: Filter rows based on a single condition
If you work with variables of the class “factor”, you need to put the value (label) in quotation marks:
Filtering with multiple conditions: Use & for “AND” and | for “OR” to combine conditions
Filtering for missing or non-missing data: To filter rows with missing values or exclude them
From nhanes select the following variables: SEQN, RIAGENDR, SMQ040, BPXSY1, BPXDI1. Assign them to the object “ds”.
For the following exercises, use the dataset “ds” and assign any operations to the object “dt”.
Filter rows where RIAGENDR is 2 (female).
Filter rows where BPXSY1 (systolic blood pressure) is greater than 120 and BPXDI1 (diastolic blood pressure) is less than 80.
Combine multiple conditions with |: Filter rows where RIAGENDR is 1 (male) OR BPXSY1 is greater than 140.
# 1. Select the variables SEQN, RIAGENDR, SMQ040, BPXSY1, BPXDI1
ds <- nhanes %>%
select(SEQN, RIAGENDR, SMQ040, BPXSY1, BPXDI1)
# 2. Filter rows for people who smoke everday (SMQ040 is 1).
dt <- ds %>%
filter(SMQ040 == 1)
# 3. Filter rows where BPXSY1 (systolic blood pressure) is greater than 120
# and BPXDI1 (diastolic blood pressure) is less than 80.
dt <- ds %>%
filter(BPXSY1 > 140 & BPXDI1 > 90)
# 4. Combine multiple conditions with |:
# Filter rows where RIAGENDR is 1 (male) OR BPXSY1 is greater than 140.
dt <- ds %>%
filter(RIAGENDR == 1 | BPXSY1 > 140)arrange() is used to reorder rows in a dataset based on the values in one or more columns. You can sort data in ascending (default) or descending order. You use arrange to:
organize data for better readability
identify the largest, smallest, or specific range of values
prepare data for summary tables or reports
Basic sorting: Sort rows in ascending order by a single variable
Sorting in descending order: Use desc() to sort rows in descending order
Sorting by multiple variables: You can sort by multiple columns, specifying the order of precedence
Sorting with missing values: By default, arrange() places missing values (NA) at the end. You can also have them on top if needed.
Use again your dataset “ds” and assign any operations to “dt”:
Sort the dataset by BPXSY1 (systolic blood pressure) in ascending order.
Sort the dataset by SMQ040 (smoking status) in descending order.
Filter rows where BPXDI1 (diastolic blood pressure) is greater than 90 and sort descending by BPXSY1.
# 1. Sort the dataset by BPXSY1 (systolic blood pressure) in ascending order.
dt <- ds %>%
arrange(BPXSY1)
# 2. Sort the dataset by SMQ040 (smoking status) in descending order.
dt <- ds %>%
arrange(desc(SMQ040))
# 3. Filter rows where BPXDI1 (diastolic blood pressure) is greater than 90 and
# sort descending by BPXSY1.
dt <- ds %>%
filter(BPXDI1 > 90) %>%
arrange(desc(BPXSY1))rename() is used to rename variables in a dataset. It allows you to provide new, meaningful names to columns while preserving the dataset’s structure. You can use it to:
To improve the readability and interpretability of your dataset.
To standardize variable names for consistency in analysis.
To simplify long or complex variable names.
Aftere the rename examples, we will keep working with the original variable names.
Basic: Renaming one variable
Renaming multiple single variables at once
Renaming variables based on patterns: Add a prefix, suffix, or replace variable names that have some common patterns in their name.
# Option 1: Add a prefix to all variables
di <- df %>%
rename_with(~ paste0("NHANES_", .))
# View the result
colnames(di)
# Option 2: Add a suffix "_new" to variables starting with "BMX"
di <- df %>%
rename_with(~ paste0(., "_new"), starts_with("BMX"))
# View the updated column names
colnames(di)
# Option 3: Replace "BMX" with "body" in variable names
di <- df %>%
rename_with(~ sub("^BMX", "body", .), starts_with("BMX"))
# View the updated column names
colnames(di)Optional - Adding variable labels: If you also want to label your variables, you can use the labelled-package.
library(labelled)
# Assigning one variable label:
var_label(dn$gender) <- "Gender of participant"
# Check the label
var_label(dn$gender)
# You can also create a list and create multiple variable labels at once and assign them to a variabel:
var_label(dn) <- list(
ID = "Participant id",
gender = "Gender of participant",
age = "Age at study",
weight = "Weight in kg",
height = "Height in kg"
)
# Check labels
var_label(dn)Practice renaming variables using the ds-dataframe. Assign all operations to “dt”.
Rename BPXSY1 to systolicBP.
Rename SMQ040 to currentSmoker and BPXDI1 to diastolicBP.
Use rename_with() to replace BPX with pressure in all variables starting with BPX.
# 1. Rename BPXSY1 to systolicBP.
dt <- ds %>%
rename(systolicBP = BPXSY1)
# 2. Rename SMQ040 to currentSmoker and BPXDI1 to diastolicBP.
dt <- ds %>%
rename(
currentSmoker = SMQ040,
diastolicBP = BPXDI1
)
# 3. Use rename_with() to replace BPX with pressure in all variables starting with BPX.
dt <- ds %>%
rename_with(~ sub("^BPX", "pressure", .), starts_with("BPX"))mutate() is used to create new variables, modify existing ones, and perform calculations on existing data. You can use mutate to:
derive new variables for analysis (e.g., calculate BMI)
recode or categorize variables (e.g., age groups)
perform transformations (e.g., converting units)
Creating new variables: Add a new variable. In this example, add the variable BMI (Body Mass Index: body weight in kg / (body height in m) ^2)
Modifying existing variables: In this example, we change age (currently in years) to months. Here, we overwrite the current age-variable. It is recommended to create a new variable
# Convert age to months
dn <- df %>%
mutate(RIDAGEYR = RIDAGEYR * 12) %>%
head %>%
print
# Convert age to month and create a new variable
dn <- df %>%
mutate(age_months = RIDAGEYR * 12) %>%
relocate(age_months, .after = RIDAGEYR) %>% # this function moves one variable after another one
head %>%
printCategorizing variables using if_else: Especially handy for binary categorizations. You can do more categories by nesting multiple if_else functions.
# Categorize age into two categories
df <- df %>%
mutate(age_group_binary = if_else(RIDAGEYR >= 18, "adult", "child"))
head(dn)
# Categorize age into three categories
df <- df %>%
mutate(age_group_three =
if_else(RIDAGEYR < 13, "child",
if_else(RIDAGEYR >= 13 & RIDAGEYR < 18, "teenager", "adult")
))
head(df)Categorizing variables using case_when: For categorizing variables, case_when is more flexibel for multiple conditions (> 2).
Use again your dataset “ds” and assign all operations to this dataset (ds).
Create a new variable called pulse_pressure, calculated as BPXSY1 - BPXDI1.
Create a binary variables smoker_status using if_else():
3 for SMQ040 (not at all), 1 otherwise (smoking everyday / some days).Categorize BPXSY1 (systolic blood pressure) into three groups using case_when().
< 120: “normal”
120-139: “elevated”
> 140: “hypertension”
# 1. Create Pulse_Pressure variable
ds <- ds %>%
mutate(pulse_pressure = BPXSY1 - BPXDI1)
# 2. Create Smoker_Status using if_else()
ds <- ds %>%
mutate(smoker_status = if_else(SMQ040 == 3, 0, 1))
# 3. Categorize systolic blood pressure using case_when()
ds <- ds %>%
mutate(bp_category = case_when(
BPXSY1 < 120 ~ "normal",
BPXSY1 >= 120 & BPXSY1 < 140 ~ "elevated",
BPXSY1 >= 140 ~ "hypertension"
))Class conversions are essential to transform variable types. You use them to:
prepare data for analysis (e.g., converting strings to factors for categorical variables).
fixing data import issues (e.g., when numeric values are read as characters).
customize variable types for specific functions (e.g., some models require factors).
String (character) to factor: To convert a character/string into a factor variable.
R orders the levels alphabetically if not specified otherwise. We can specify the level order using an adapted code:
df <- df %>%
mutate(BMI_category = factor(BMI_category,
levels = c("underweight", "normal", "overweight", "obese")))
# Check the class of the variable
class(df$BMI_category)
levels(df$BMI_category)
# Optional: If you want to set a different reference category, you can use the relevel() function
dn <- df %>%
mutate(BMI_category = relevel(BMI_category, ref = "normal"))
levels(dn$BMI_category)Numeric to factor: Useful for categorical variables stored as numbers.
# Example: Convert Gender (numeric) to a factor
dn <- df %>%
mutate(RIAGENDR = as.factor(RIAGENDR))
class(dn$RIAGENDR)
levels(dn$RIAGENDR)
# Optionally, you can also assign value labels
df <- df %>%
mutate(RIAGENDR = factor(RIAGENDR,
levels = c(1, 2),
labels = c("male", "female")))
# Check the result
class(df$RIAGENDR)
levels(df$RIAGENDR)
nlevels(df$RIAGENDR) # Gives you the number of levelsFactor to numeric: When you import datasets, it can happen that a numeric variable is recognized as a factor variable, which you then have to change:
Work again with your dataframe “ds” and assign all operations to the dataframe “ds”.
Convert smoker_status from numeric to a factor with the following labels:
0: “no smoker”
1: “current smoker”
Convert the variable bp_category from string to factor variable. Order the levels the following: “normal”, “elevated”, “hypertension”.
# 1. Convert SMQ040 (smoking status) from numeric to a factor with the following labels:
ds <- ds %>%
mutate(smoker_status = factor(smoker_status,
levels = c(0, 1),
labels = c("no smoker", "current smoker")))
# 2. Convert the variable bp_category from string to factor variable. Order the levels the following: "normal", "elevated", "high".
ds <- ds %>%
mutate(bp_category = factor(bp_category,
levels = c("normal", "elevated", "hypertension")))Now that you prepared your dataset for analysis, we can run exploratory data analysis. We do this to uncover patterns, spot anomalies, and summarize its key characteristics. It usually involves:
Descriptive statistics: Summarizing individual variables (e.g., mean and median)
Visualizations: Exploring distributions and relationships between variables (e.g., boxplot, correlation matrix)
Group comparisons: Comparing metrics across different categories (e.g., cross-tabulations, mean by group).
We will use different packages for exploratory data analysis including tidyverse, which you already know, and the packages “table1” and “summarytools”.
We will calculate some basic descriptive statistics for numeric and factor variables. There are various ways of doing this in R, we will show you a couple of options with Base R, tidyverse, and summarytools.
Base R
tidyverse
# calculate basic descriptive statistics for one variable
df %>%
summarize(
Min_Age = min(RIDAGEYR, na.rm = TRUE),
Mean_Age = mean(RIDAGEYR, na.rm = TRUE),
Median_Age = median(RIDAGEYR, na.rm = TRUE),
Max_Age = max(RIDAGEYR, na.rm = TRUE)
)
# calculate basic descriptive statistics for multiple variables
df %>%
summarize(
across(
c(RIDAGEYR, BMXWT, BMXHT),
list(
Min = ~ min(.x, na.rm = TRUE),
Mean = ~ mean(.x, na.rm = TRUE),
Median = ~ median(.x, na.rm = TRUE),
Max = ~ max(.x, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)
)
# The .names argument controls how the new column names are generated:
# {.col} refers to the variable name (e.g., Variable1).
# {.fn} refers to the function name (e.g., Min, Mean, etc.).
# This ensures the output columns have clear and unique names.summarytools
Base R
tidyverse
summarytools
If you want to create a more comprehensive overview of both numeric and factor variables, the packages summarytools and table 1 can be very helpful.
table1: This is a useful function to create an overview of various variable types, and is a great format that you can also export to a word document.
summarytools: The function “dfSummary” creates a comprehensive overview of dataframe, including basic descriptive statistics, value codings, histogramms / bar plots, and missings.
Use your dataframe ds for the following exercises.
Calculate basic descriptive statistics (minimum, mean, median, and maximum) for the following numeric variables: BPXSY1 (systolic blood pressure), BPXDI1 (diastolic blood pressure).
Summarize frequencies and proportions for bp_category (blood pressure category) and smoker_status.
Base R:
# 1. Calculate basic descriptive statistics (minimum, mean, median, and maximum) for the following numeric variables: BPXSY1 (systolic blood pressure), BPXDI1 (diastolic blood pressure).
# BPXSY1 (Systolic blood pressure)
summary(ds$BPXSY1)
min(ds$BPXSY1, na.rm = TRUE)
mean(ds$BPXSY1, na.rm = TRUE)
median(ds$BPXSY1, na.rm = TRUE)
max(ds$BPXSY1, na.rm = TRUE)
# BPXDI1 (Diastolic blood pressure)
summary(ds$BPXDI1)
min(ds$BPXDI1, na.rm = TRUE)
mean(ds$BPXDI1, na.rm = TRUE)
median(ds$BPXDI1, na.rm = TRUE)
max(ds$BPXDI1, na.rm = TRUE)
# 2. Summarize frequencies and proportions for bp_category (blood pressure category) and smoker_status.
# Frequency and proportions for bp_category
table(ds$bp_category)
prop.table(table(ds$bp_category))
# Frequency and proportions for smoker_status
table(ds$smoker_status)
prop.table(table(ds$smoker_status))Tidyverse:
# 1. Calculate basic descriptive statistics (minimum, mean, median, and maximum) for the following numeric variables: BPXSY1 (systolic blood pressure), BPXDI1 (diastolic blood pressure).
## tidyverse
ds %>% # specifying for each variable on its own
summarize(
Min_Systolic = min(BPXSY1, na.rm = TRUE),
Mean_Systolic = mean(BPXSY1, na.rm = TRUE),
Median_Systolic = median(BPXSY1, na.rm = TRUE),
Max_Systolic = max(BPXSY1, na.rm = TRUE),
Min_Diastolic = min(BPXDI1, na.rm = TRUE),
Mean_Diastolic = mean(BPXDI1, na.rm = TRUE),
Median_Diastolic = median(BPXDI1, na.rm = TRUE),
Max_Diastolic = max(BPXDI1, na.rm = TRUE)
)
ds %>% # using summarize across
summarize(
across(
c(BPXSY1, BPXDI1),
list(
Min = ~ min(.x, na.rm = TRUE),
Mean = ~ mean(.x, na.rm = TRUE),
Median = ~median(.x, na.rm = TRUE),
Max = ~max(.x, na.rm = TRUE)
),
.names = "{.col}_{.fn}")
)
# 2. Summarize frequencies and proportions for bp_category (blood pressure category) and smoker_status.
## tidyverse
# Frequency and proportions for bp_category
ds %>%
count(bp_category) %>%
mutate(Proportion = n / sum(n))
# Frequency and proportions for smoker_status
ds %>%
count(smoker_status) %>%
mutate(Proportion = n / sum(n))Summarytools and table1:
# 1. Calculate basic descriptive statistics (minimum, mean, median, and maximum) for the following numeric variables: BPXSY1 (systolic blood pressure), BPXDI1 (diastolic blood pressure).
descr(ds$BPXSY1)
descr(ds$BPXDI1)
# 2. Summarize frequencies and proportions for bp_category (blood pressure category) and smoker_status.
freq(ds$bp_category)
freq(ds$smoker_status)
## Tasks 1 & 2 together
# using table1
table1(~ BPXSY1 + BPXDI1 + bp_category + smoker_status, data = ds)
# using summarytools
dfSummary(ds)While descriptive statistics are helpful, any exploratory data analysis requires visual inspection to
explore variable distributions
assess the spread of data and outliers
examine relationships between two numeric variables
visualize categorical data
Note on visualizations: Here, we will use baseR for simplicity. However, tidyverse (ggplot2) can also be used.
Histograms display the distribution of a numeric variable by dividing it into intervals (bins) and showing the frequency of observations in each interval.
Boxplots summarize the distribution of a numeric variable, highlighting median, quartiles, and potential outliers.
Scatterplots are used to explore relationships between two numeric variables.
Bar charts are used to visualize frequencies of categorical variables.
BPXSY1 (systolic blood pressure).BPXSY1 (diastolic blood pressure) and stratify by smoker_status. Title the figure “Blood pressure and smoker status”. Label the x-axis “Smoking status”, the y-axis “Systolic blood pressure”. Color the boxplots in blue and yellow.BPXSY1 (systolic blood pressure) and BPXDI1 (diastolic blood pressure).bp_category (blood pressure category).# 2. Create a boxplot for BPXSY1 (diastolic blood pressure) and stratify by smoker_status. Title the figure "Blood pressure and smoker status". Label the x-axis "Smoking status", the y-axis "Systolic blood pressure". Color the boxplots in blue and yellow.
boxplot(
BPXSY1 ~ smoker_status,
data = ds,
main = "Blood pressure and smoker status",
xlab = "Smoker status",
ylab = "Systolic blood pressure",
col = c("blue", "yellow")
)Sometimes you may want to explore descriptives separately for groups. We will go through
grouping data by one or more categorical variables.
calculate descriptive statistics (e.g., mean, median) for numeric variables within groups
optionally include statistical tests for group differences
tidyverse and table1 are two helpful packages for calculating descriptives for each group.
tidyverse: In tidyverse, you can use the group_by() function to arrange data based upon a factor variable (e.g., age group), and to then calculate descriptives for each group.
table1: You can also use table 1 to stratify calculations by certain groups, using the “|” operator.
filter for participants without missings for BPXSY1
group the dataset by bp_category
calculate the mean and median for BPXSY1 (systolic blood pressure)
pulse_pressure stratified by RIAGENDR.# 1. Use tidyverse to filter for participants without missings for BPXSY1, group the dataset by bp_category calculate the mean and median for BPXSY1 (systolic blood pressure)
ds %>%
filter(!is.na(BPXSY1)) %>%
group_by(bp_category) %>%
summarize(
mean_Systolic = mean(BPXSY1, na.rm = TRUE),
median_Systolic = median(BPXSY1, na.rm = TRUE),
count = n()
)
# 2. Using table1, create a summary table for pulse_pressure stratified by RIAGENDR
table1(~ pulse_pressure | RIAGENDR,
data = ds)To assess associations between categorical variables, you need contigency tables or cross-tabulations. To do those, you can again work with Base R, tidyverse, and summarytools.
Base R
underweight normal overweight obese
adult 100 1382 1725 2227
child 1262 448 101 44
teenager 97 365 127 127
underweight normal overweight obese
adult 0.01840265 0.25432462 0.31744571 0.40982702
child 0.68032345 0.24150943 0.05444744 0.02371968
teenager 0.13547486 0.50977654 0.17737430 0.17737430
table1
# BMI category by age group
table1(~ BMI_category | age_group_three,
render.missing = NULL, # take this out if you also want to see the missings
data = df)| adult (N=5856) |
child (N=2647) |
teenager (N=751) |
Overall (N=9254) |
|
|---|---|---|---|---|
| BMI_category | ||||
| underweight | 100 (1.7%) | 1262 (47.7%) | 97 (12.9%) | 1459 (15.8%) |
| normal | 1382 (23.6%) | 448 (16.9%) | 365 (48.6%) | 2195 (23.7%) |
| overweight | 1725 (29.5%) | 101 (3.8%) | 127 (16.9%) | 1953 (21.1%) |
| obese | 2227 (38.0%) | 44 (1.7%) | 127 (16.9%) | 2398 (25.9%) |
summarytools
Cross-Tabulation, Row Proportions
BMI_category * age_group_three
Data Frame: df
-------------- ----------------- -------------- -------------- ------------- ---------------
age_group_three adult child teenager Total
BMI_category
underweight 100 ( 6.9%) 1262 (86.5%) 97 ( 6.6%) 1459 (100.0%)
normal 1382 (63.0%) 448 (20.4%) 365 (16.6%) 2195 (100.0%)
overweight 1725 (88.3%) 101 ( 5.2%) 127 ( 6.5%) 1953 (100.0%)
obese 2227 (92.9%) 44 ( 1.8%) 127 ( 5.3%) 2398 (100.0%)
<NA> 422 (33.8%) 792 (63.4%) 35 ( 2.8%) 1249 (100.0%)
Total 5856 (63.3%) 2647 (28.6%) 751 ( 8.1%) 9254 (100.0%)
-------------- ----------------- -------------- -------------- ------------- ---------------
# You can also drop missing, use column percentages and add a chi-square test
ctable(df$BMI_category, df$age_group_three,
useNA = "no",
prop = "c",
chisq = TRUE)Cross-Tabulation, Column Proportions
BMI_category * age_group_three
Data Frame: df
-------------- ----------------- --------------- --------------- -------------- ---------------
age_group_three adult child teenager Total
BMI_category
underweight 100 ( 1.8%) 1262 ( 68.0%) 97 ( 13.5%) 1459 ( 18.2%)
normal 1382 ( 25.4%) 448 ( 24.2%) 365 ( 51.0%) 2195 ( 27.4%)
overweight 1725 ( 31.7%) 101 ( 5.4%) 127 ( 17.7%) 1953 ( 24.4%)
obese 2227 ( 41.0%) 44 ( 2.4%) 127 ( 17.7%) 2398 ( 30.0%)
Total 5434 (100.0%) 1855 (100.0%) 716 (100.0%) 8005 (100.0%)
-------------- ----------------- --------------- --------------- -------------- ---------------
----------------------------
Chi.squared df p.value
------------- ---- ---------
4627.584 6 0
----------------------------
Using your dataset ds, * create a contigency table of bp_category and smoker_status.
# Cross-tabulation with ctable of bp_category and smoker_status
ctable(ds$bp_category, ds$smoker_status, prop = "c", useNA = "no", chisq = TRUE)Cross-Tabulation, Column Proportions
bp_category * smoker_status
Data Frame: ds
-------------- --------------- --------------- ---------------- ---------------
smoker_status no smoker current smoker Total
bp_category
normal 375 ( 33.3%) 359 ( 40.9%) 734 ( 36.6%)
elevated 439 ( 39.0%) 340 ( 38.8%) 779 ( 38.9%)
hypertension 313 ( 27.8%) 178 ( 20.3%) 491 ( 24.5%)
Total 1127 (100.0%) 877 (100.0%) 2004 (100.0%)
-------------- --------------- --------------- ---------------- ---------------
----------------------------
Chi.squared df p.value
------------- ---- ---------
19.159 2 1e-04
----------------------------